In this tutorial we'll build a new agent that implements the Categorical Deep Q Network (C51) algorithm (https://arxiv.org/pdf/1707.06887.pdf), and a preset that runs the agent on the 'Breakout' game of the Atari environment.
Implementing an algorithm typically consists of 3 main parts:
The entire agent can be defined outside of the Coach framework, but in Coach you can find multiple predefined agents under the agents
directory, network heads under the architecure/tensorflow_components/heads
directory, and presets under the presets
directory, for you to reuse.
For more information, we recommend going over the following page in the documentation: https://nervanasystems.github.io/coach/contributing/add_agent/
We'll start by defining a new head for the neural network used by this algorithm - CategoricalQHead
.
A head is the final part of the network. It takes the embedding from the middleware embedder and passes it through a neural network to produce the output of the network. There can be multiple heads in a network, and each one has an assigned loss function. The heads are algorithm dependent.
The rest of the network can be reused from the predefined parts, and the input embedder and middleware structure can also be modified, but we won't go into that in this tutorial.
The head will typically be defined in a new file - architectures/tensorflow_components/heads/categorical_dqn_head.py
.
First - some imports.
In [ ]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
sys.path.append(module_path)
import tensorflow as tf
from rl_coach.architectures.tensorflow_components.heads.head import Head
from rl_coach.architectures.head_parameters import HeadParameters
from rl_coach.base_parameters import AgentParameters
from rl_coach.core_types import QActionStateValue
from rl_coach.spaces import SpacesDefinition
Now let's define a class - CategoricalQHead
class. Each class in Coach has a complementary Parameters class which defines its constructor parameters. So we will additionally define the CategoricalQHeadParameters
class. The network structure should be defined in the _build_module
function, which gets the previous layer output as an argument. In this function there are several variables that should be defined:
self.input
- (optional) a list of any additional input to the headself.output
- the output of the head, which is also one of the outputs of the networkself.target
- a placeholder for the targets that will be used to train the networkself.regularizations
- (optional) any additional regularization losses that will be applied to the networkself.loss
- the loss that will be used to train the networkCategorical DQN uses the same network as DQN, and only changes the last layer to output #actions x #atoms elements with a softmax function. Additionally, we update the loss function to cross entropy.
In [ ]:
class CategoricalQHeadParameters(HeadParameters):
def __init__(self, activation_function: str ='relu', name: str='categorical_q_head_params'):
super().__init__(parameterized_class=CategoricalQHead, activation_function=activation_function, name=name)
class CategoricalQHead(Head):
def __init__(self, agent_parameters: AgentParameters, spaces: SpacesDefinition, network_name: str,
head_idx: int = 0, loss_weight: float = 1., is_local: bool = True, activation_function: str ='relu'):
super().__init__(agent_parameters, spaces, network_name, head_idx, loss_weight, is_local, activation_function)
self.name = 'categorical_dqn_head'
self.num_actions = len(self.spaces.action.actions)
self.num_atoms = agent_parameters.algorithm.atoms
self.return_type = QActionStateValue
def _build_module(self, input_layer):
self.actions = tf.placeholder(tf.int32, [None], name="actions")
self.input = [self.actions]
values_distribution = tf.layers.dense(input_layer, self.num_actions * self.num_atoms, name='output')
values_distribution = tf.reshape(values_distribution, (tf.shape(values_distribution)[0], self.num_actions,
self.num_atoms))
# softmax on atoms dimension
self.output = tf.nn.softmax(values_distribution)
# calculate cross entropy loss
self.distributions = tf.placeholder(tf.float32, shape=(None, self.num_actions, self.num_atoms),
name="distributions")
self.target = self.distributions
self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.target, logits=values_distribution)
tf.losses.add_loss(self.loss)
The agent will implement the Categorical DQN algorithm. Each agent has a complementary AgentParameters
class, which allows selecting the parameters of the agent sub modules:
Now let's go ahead and define the network parameters - it will reuse the DQN network parameters but the head parameters will be our CategoricalQHeadParameters
. The network parameters allows selecting any number of heads for the network by defining them in a list, but in this case we only have a single head, so we will point to its parameters class.
In [ ]:
from rl_coach.agents.dqn_agent import DQNNetworkParameters
class CategoricalDQNNetworkParameters(DQNNetworkParameters):
def __init__(self):
super().__init__()
self.heads_parameters = [CategoricalQHeadParameters()]
Next we'll define the algorithm parameters, which are the same as the DQN algorithm parameters, with the addition of the Categorical DQN specific v_min
, v_max
and number of atoms.
We'll also define the parameters of the exploration policy, which is epsilon greedy with epsilon starting at a value of 1.0 and decaying to 0.01 throughout 1,000,000 steps.
In [ ]:
from rl_coach.agents.dqn_agent import DQNAlgorithmParameters
from rl_coach.exploration_policies.e_greedy import EGreedyParameters
from rl_coach.schedules import LinearSchedule
class CategoricalDQNAlgorithmParameters(DQNAlgorithmParameters):
def __init__(self):
super().__init__()
self.v_min = -10.0
self.v_max = 10.0
self.atoms = 51
class CategoricalDQNExplorationParameters(EGreedyParameters):
def __init__(self):
super().__init__()
self.epsilon_schedule = LinearSchedule(1, 0.01, 1000000)
self.evaluation_epsilon = 0.001
Now let's define the agent parameters class which contains all the parameters to be used by the agent - the network, algorithm and exploration parameters that we defined above, and also the parameters of the memory module to be used, which is the default experience replay buffer in this case.
Notice that the networks are defined as a dictionary, where the key is the name of the network and the value is the network parameters. This will allow us to later access each of the networks through self.networks[network_name]
.
The path
property connects the parameters class to its corresponding class that is parameterized. In this case, it is the CategoricalDQNAgent
class that we'll define in a moment.
In [ ]:
from rl_coach.agents.value_optimization_agent import ValueOptimizationAgent
from rl_coach.base_parameters import AgentParameters
from rl_coach.core_types import StateType
from rl_coach.memories.non_episodic.experience_replay import ExperienceReplayParameters
class CategoricalDQNAgentParameters(AgentParameters):
def __init__(self):
super().__init__(algorithm=CategoricalDQNAlgorithmParameters(),
exploration=CategoricalDQNExplorationParameters(),
memory=ExperienceReplayParameters(),
networks={"main": CategoricalDQNNetworkParameters()})
@property
def path(self):
return 'agents.categorical_dqn_agent:CategoricalDQNAgent'
The last step is to define the agent itself - CategoricalDQNAgent
- which is a type of value optimization agent so it will inherit the ValueOptimizationAgent
class. It could have also inheritted DQNAgent
, which would result in the same functionality. Our agent will implement the learn_from_batch
function which updates the agent's networks according to an input batch of transitions.
Agents typically need to implement the training function - learn_from_batch
, and a function that defines which actions to select given a state - choose_action
. In our case, we will reuse the choose_action
function implemented by the generic ValueOptimizationAgent
, and just update the internal function for fetching q values for each of the actions - get_all_q_values_for_states
.
This code may look intimidating at first glance, but basically it is just following the algorithm description in the Distributional DQN paper:
In [ ]:
from typing import Union
# Categorical Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf
class CategoricalDQNAgent(ValueOptimizationAgent):
def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):
super().__init__(agent_parameters, parent)
self.z_values = np.linspace(self.ap.algorithm.v_min, self.ap.algorithm.v_max, self.ap.algorithm.atoms)
def distribution_prediction_to_q_values(self, prediction):
return np.dot(prediction, self.z_values)
# prediction's format is (batch,actions,atoms)
def get_all_q_values_for_states(self, states: StateType):
prediction = self.get_prediction(states)
return self.distribution_prediction_to_q_values(prediction)
def learn_from_batch(self, batch):
network_keys = self.ap.network_wrappers['main'].input_embedders_parameters.keys()
# for the action we actually took, the error is calculated by the atoms distribution
# for all other actions, the error is 0
distributed_q_st_plus_1, TD_targets = self.networks['main'].parallel_prediction([
(self.networks['main'].target_network, batch.next_states(network_keys)),
(self.networks['main'].online_network, batch.states(network_keys))
])
# only update the action that we have actually done in this transition
target_actions = np.argmax(self.distribution_prediction_to_q_values(distributed_q_st_plus_1), axis=1)
m = np.zeros((self.ap.network_wrappers['main'].batch_size, self.z_values.size))
batches = np.arange(self.ap.network_wrappers['main'].batch_size)
for j in range(self.z_values.size):
tzj = np.fmax(np.fmin(batch.rewards() +
(1.0 - batch.game_overs()) * self.ap.algorithm.discount * self.z_values[j],
self.z_values[self.z_values.size - 1]),
self.z_values[0])
bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])
u = (np.ceil(bj)).astype(int)
l = (np.floor(bj)).astype(int)
m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))
m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))
# total_loss = cross entropy between actual result above and predicted result for the given action
TD_targets[batches, batch.actions()] = m
result = self.networks['main'].train_and_sync_networks(batch.states(network_keys), TD_targets)
total_loss, losses, unclipped_grads = result[:3]
return total_loss, losses, unclipped_grads
Some important things to notice here:
self.networks['main']
is a NetworkWrapper object. It holds all the copies of the 'main' network: predict
and parallel_prediction
. predict
is quite straightforward - get some inputs, forward them through the network and return the output. parallel_prediction
is an optimized variant of predict
, which allows running a prediction on the online and target network in parallel, instead of running them sequentially.train_and_sync_networks
function makes a single training step - running a forward pass of the online network, calculating the losses, running a backward pass to calculate the gradients and applying the gradients to the network weights. If multiple workers are used, instead of applying the gradients to the online network weights, they are applied to the global (shared) network weights, and then the weights are copied back to the online network.The final part is the preset, which will run our agent on some existing environment with any custom parameters.
The new preset will be typically be defined in a new file - presets/atari_categorical_dqn.py
.
First - let's select the agent parameters we defined above. It is possible to modify internal parameters such as the learning rate.
In [ ]:
from rl_coach.agents.categorical_dqn_agent import CategoricalDQNAgentParameters
agent_params = CategoricalDQNAgentParameters()
agent_params.network_wrappers['main'].learning_rate = 0.00025
Now, let's define the environment parameters. We will use the default Atari parameters (frame skip of 4, taking the max over subsequent frames, etc.), and we will select the 'Breakout' game level.
In [ ]:
from rl_coach.environments.gym_environment import Atari, atari_deterministic_v4
env_params = Atari(level='BreakoutDeterministic-v4')
Connecting all the dots together - we'll define a graph manager with the Categorial DQN agent parameters, the Atari environment parameters, and the scheduling and visualization parameters
In [ ]:
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.environments.gym_environment import atari_schedule
graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,
schedule_params=atari_schedule, vis_params=VisualizationParameters())
graph_manager.visualization_parameters.render = True
In [ ]:
# let the adventure begin
graph_manager.improve()